refactor(context): deduplicate register/read option-building logic#1479
refactor(context): deduplicate register/read option-building logic#1479mesejo wants to merge 1 commit intoapache:mainfrom
Conversation
Extract shared helpers (convert_partition_cols, convert_file_sort_order, build_parquet/json/avro_options, convert_csv_options), standardize path types to &str, and remove redundant intermediate variables.
fe84781 to
39e190f
Compare
There was a problem hiding this comment.
Pull request overview
Refactors PySessionContext read/register APIs to reduce duplicated option-building logic across Parquet/CSV/JSON/Avro operations.
Changes:
- Extracted shared helper functions to convert partition columns / sort order and to build Parquet/JSON/Avro/Csv read options.
- Simplified read/register implementations by removing intermediate variables and consolidating async wait patterns.
- Standardized some path parameters from
PathBufto&strfor JSON/Avro APIs.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn register_json( | ||
| &self, | ||
| name: &str, | ||
| path: PathBuf, | ||
| path: &str, | ||
| schema: Option<PyArrowType<Schema>>, |
There was a problem hiding this comment.
Changing path from PathBuf to &str in this #[pymethods] API is a breaking change for Python callers that previously passed pathlib.Path / other os.PathLike objects (PyO3 can extract those into PathBuf, but not into &str). Consider keeping PathBuf (and moving the to_str() conversion into a helper), or accept a Bound<'_, PyAny> and explicitly support both str and os.PathLike inputs to preserve the existing Python surface area.
| #[allow(clippy::too_many_arguments)] | ||
| #[pyo3(signature = (path, schema=None, schema_infer_max_records=1000, file_extension=".json", table_partition_cols=vec![], file_compression_type=None))] | ||
| pub fn read_json( | ||
| &self, | ||
| path: PathBuf, | ||
| path: &str, | ||
| schema: Option<PyArrowType<Schema>>, |
There was a problem hiding this comment.
read_json now takes path: &str instead of PathBuf, which likely removes support for pathlib.Path / os.PathLike inputs in the Python API. If this is intended, it should be treated as user-facing and called out; otherwise keep the previous PathBuf signature (or accept a generic Python object and convert via os.fspath).
There was a problem hiding this comment.
I think this one is a valid concern, including the other two places copilot found
| #[allow(clippy::too_many_arguments)] | ||
| #[pyo3(signature = (name, | ||
| path, | ||
| schema=None, | ||
| file_extension=".avro", | ||
| table_partition_cols=vec![]))] | ||
| pub fn register_avro( | ||
| &self, | ||
| name: &str, | ||
| path: PathBuf, | ||
| path: &str, | ||
| schema: Option<PyArrowType<Schema>>, |
There was a problem hiding this comment.
Same issue as JSON: changing register_avro's path parameter from PathBuf to &str breaks Python callers that pass pathlib.Path/os.PathLike. Prefer keeping PathBuf (or explicitly supporting os.fspath conversion) to avoid an API regression.
| pub fn register_json( | ||
| &self, | ||
| name: &str, | ||
| path: PathBuf, | ||
| path: &str, | ||
| schema: Option<PyArrowType<Schema>>, | ||
| schema_infer_max_records: usize, | ||
| file_extension: &str, | ||
| table_partition_cols: Vec<(String, PyArrowType<DataType>)>, | ||
| file_compression_type: Option<String>, | ||
| py: Python, | ||
| ) -> PyDataFusionResult<()> { |
There was a problem hiding this comment.
The PR description says "No user-facing changes", but the switch from PathBuf to &str for JSON/Avro read/register methods is user-facing for Python (it changes which argument types are accepted, e.g., pathlib.Path). Either restore the previous behavior or update the PR description / changelog notes accordingly.
timsaucer
left a comment
There was a problem hiding this comment.
Would you mind merging main and giving register_arrow and read_arrow the similar treatment?
|
Nice cleanup! |
Which issue does this PR close?
N/A
Rationale for this change
Reduce the number of execution paths for register and read functions
What changes are included in this PR?
Extract shared helpers (convert_partition_cols, convert_file_sort_order, build_parquet/json/avro_options, convert_csv_options), standardize path types to &str, and remove redundant intermediate variables.
Are there any user-facing changes?
No